Mining atomic Chinese abbreviations with a probabilistic single character recovery model

نویسندگان

  • Jing-Shin Chang
  • Wei-Lun Teng
چکیده

An HMM-based single character recovery (SCR) model is proposed in this paper to extract a large set of atomic abbreviations and their full forms from a text corpus. By an ‘‘atomic abbreviation,’’ it refers to an abbreviated word consisting of a single Chinese character. This task is important since Chinese abbreviations cannot be enumerated exhaustively but the abbreviation process for compound words seems to be compositional. One can often decode an abbreviated word character by character to its full form. With a large atomic abbreviation dictionary, one may be able to handle multiple character abbreviation problems more easily based on the compositional property of abbreviations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Atomic Chinese Abbreviation Pairs with a Probabilistic Single Character Word Recovery Model

An HMM-based Single Character Recovery (SCR) Model is proposed in this paper to extract a large set of “atomic abbreviation pairs”from a text corpus. By an “atomic abbreviation pair,”it refers to an abbreviated word and its root word (i.e., unabbreviated form) in which the abbreviation is a single Chinese character. This task is important since Chinese abbreviations cannot be enumerated exhaust...

متن کامل

Mining Atomic Chinese Abbreviation Pairs: A Probabilistic Model for Single Character Word Recovery

An HMM-based Single Character Recovery (SCR) Model is proposed in this paper to extract a large set of “atomic abbreviation pairs”from a large text corpus. By an “atomic abbreviation pair,”it refers to an abbreviated word and its root word (i.e., unabbreviated form) in which the abbreviation is a single Chinese character. This task is interesting since the abbreviation process for Chinese compo...

متن کامل

A Preliminary Study on Probabilistic Models for Chinese Abbreviations

Chinese abbreviations are widely used in the modern Chinese texts. They are a special form of unknown words, including many named entities. This results in difficulty for correct Chinese processing. In this study, the Chinese abbreviation problem is regarded as an error recovery problem in which the suspect root words are the “errors” to be recovered from a set of candidates. Such a problem is ...

متن کامل

The use of probabilistic lexicality cues for word segmentation in Chinese reading.

In an eye-tracking experiment we examined whether Chinese readers were sensitive to information concerning how often a Chinese character appears as a single-character word versus the first character in a two-character word, and whether readers use this information to segment words and adjust the amount of parafoveal processing of subsequent characters during reading. Participants read sentences...

متن کامل

A Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis

Basically, medical diagnosis problems are the most effective component of treatment policies. Recently, significant advances have been formed in medical diagnosis fields using data mining techniques. Data mining or Knowledge Discovery is searching large databases to discover patterns and evaluate the probability of next occurrences. In this paper, Bayesian Classifier is used as a Non-linear dat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 40  شماره 

صفحات  -

تاریخ انتشار 2006